Predicting Student Dropout and Academic Success

Authors
Affiliation

Patricia Götz

University of Lausanne

Lana Kabbani

Noémie Glaus

Estela Gonzalez Vizcarra

Published

November 18, 2025

1. - Introduction

Student retention and academic success are crucial challenges for higher education institutions worldwide. Recent international observations show rising university dropout trends across multiple regions, including Australia and the United States (Sokolova, 2025). Looking closer at Europe, recent data from the German Center for Higher Education Research and Science Studies (2022), show that almost 30% of bachelor’s students in Germany leave university without graduating (Hachmeister & Berghoff, 2024). In Portugal, which is the focus of our analysis, recent data by Statistics Portugal reveal that a considerable portion of young adults (16.8%) aged from 15 to 34 have dropped out at least one level of education during their academic path (Europe-Data.com, 2025). Moreover, among those who dropped out, over more than half (50.8%) did not complete their tertiary studies, highlighting that higher education represents a critical point of disengagement (Europe-Data.com, 2025). These figures underline the seriousness of dropouts in higher education and the reinforced need for universities to rely on data-driven insights to identify at-risk students and to design early intervention strategies. We chose this topic because predicting student dropout not only helps optimize institutional resources but also supports students in achieving their academic goals. Understanding the factors that influence academic success, such as socio-economic background, previous academic performance, or family situation, can improve educational policies and personalized support systems. This subject is particularly meaningful in data science, as it allows us to combine analytical and predictive methods to better understand and prevent student dropout.

1.1 - Project Goals

The main objective of this project is to identify the factors that influence students to drop out, stay enrolled, or graduate from higher education. The dataset provides detailed information on each student’s academic performance, socioeconomic background, and demographic profile, offering a comprehensive view of the variables that shape educational outcomes. By the end of our analysis, we seek to identify the most significant combinations of academic and personal factors that influence student success. First, our analysis will focus on academic performance, examining how variables such as admission grades, semester evaluations, and course results relate to final outcomes. For instance, we will analyze whether early academic performance can serve as a reliable predictor of future dropout risk. We will then explore the influence of socioeconomic and personal factors, including parental education, occupation, and financial situation, to understand their impact on academic achievement. Lastly, the dataset will be used to build and evaluate classification models that predict students’ academic status (Dropout, Enrolled, or Graduate). In summary, this study combines exploratory analysis, visualization, and predictive modeling to generate actionable insights that help universities detect at-risk students early and strengthen academic success.

I.3 - Research Questions

    1. How do academic performance indicators and study conditions influence students’ likelihood of graduation or dropout?
    1. What is the impact of demographic and socioeconomic background on students’ probability of dropping out?
      1. To what extent do financial factors (debtor status, scholarship holder) affect student retention ?
    1. Can we accurately predict a student’s final status (Dropout, Enrolled, or Graduate) based on their demographic, socioeconomic, and academic characteristics. Which are the most relevant among them? a.Which features category, academic (grades, units), socioeconomic (debt, scholarship) or demographic (age, gender) contribute the most in predicting students’ dropout?

2. - Data

2.1 - Data Sourcing

The dataset is publicly available on UCI Machine Learning Repository and was created from multiple databases of higher education institutions in Portugal. It is related to enrolled students in different undergraduate programs and shows how different demographic, socioeconomic and academic factors are related to the dropout. Since the data has already been collected and can be directly downloaded from UCI MLR - Predict Students’ Dropout and Academic Success - [Accessed on 20th October] , there is no need to collect more data via webscraping or APIs.

2.2 - Data Description

The dataset, containing data from a Portuguese higher education institution, is provided as a CSV file, approximately 520 KB in size, and contains detailed information about students’demographic, academic and socio-economic characteristics. It includes 4424 student records and 37 variables (features). After reviewing the dataset variables, we removed two irrelevant ones, resulting in 35 relevant variables selected for analysis.

We didn’t encounter any difficult challenges. The dataset was already clean and encoded, so we didn’t need to perform variable merging, one-hot encoding or ordinal encoding. We only had to translate categorical variables into readable labels to facilitate our visualization analysis.

2.2.1 - Data Loading

Code
# Import libraries
from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

# Load data
dataset = fetch_ucirepo(id=697)
X = np.array(dataset.data.features)
y = np.array(dataset.data.targets)

# Create dataframe
col_names = dataset.variables["name"]
df = pd.DataFrame(np.column_stack((X, y)), columns=col_names)

print(f"Dataset shape: {df.shape}")
Dataset shape: (4424, 37)

2.2.2 - Variable Selection

We selected 35 relevant variables for analysis:

Code
selected_columns = [ 
    "Marital Status", 
    "Application order", 
    "Course", 
    "Daytime/evening attendance", 
    "Previous qualification", 
    "Previous qualification (grade)",
    "Nacionality", 
    "Mother's qualification", 
    "Father's qualification", 
    "Mother's occupation", 
    "Father's occupation", 
    "Admission grade", 
    "Educational special needs", 
    "Gender", 
    "Scholarship holder", 
    "Age at enrollment", 
    "Displaced", 
    "Debtor", 
    "International", 
    "Curricular units 1st sem (credited)", 
    "Curricular units 1st sem (enrolled)", 
    "Curricular units 1st sem (evaluations)",
    "Curricular units 1st sem (approved)", 
    "Curricular units 1st sem (grade)", 
    "Curricular units 1st sem (without evaluations)", 
    "Curricular units 2nd sem (credited)", 
    "Curricular units 2nd sem (enrolled)", 
    "Curricular units 2nd sem (evaluations)", 
    "Curricular units 2nd sem (approved)", 
    "Curricular units 2nd sem (grade)", 
    "Curricular units 2nd sem (without evaluations)", 
    "Unemployment rate", 
    "Inflation rate", 
    "GDP", 
    "Target", 
]

df = df[selected_columns].copy()
print(f"Selected {len(selected_columns)} variables")
Selected 35 variables

2.2.3 - Selected Variable Descriptions

Variable Description Type
Marital Status Student marital status Categorical
Application order Application preference order Categorical
Course Course taken by student Categorical
Daytime/evening attendance Attendance type (daytime or evening) Categorical
Previous qualification Type of previous qualification Categorical
Previous qualification (grade) Grade of previous qualification Numerical (Continuous)
Nacionality Student nationality Categorical
Mother’s qualification Educational qualification of mother Categorical
Father’s qualification Educational qualification of father Categorical
Mother’s occupation Occupation of mother Categorical
Father’s occupation Occupation of father Categorical
Admission grade Admission grade to the program Numerical (Continuous)
Educational special needs Whether student has special educational needs Binary
Gender Student gender Binary
Scholarship holder Whether student is scholarship holder Binary
Age at enrollment Age of student at enrollment Numerical (Discrete)
Displaced Whether student is displaced from home Binary
Debtor Whether student is a debtor Binary
International Whether student is international Binary
Curricular units 1st sem (credited) Credited units in 1st semester Numerical (Discrete)
Curricular units 1st sem (enrolled) Enrolled units in 1st semester Numerical (Discrete)
Curricular units 1st sem (evaluations) Number of evaluations in 1st semester Numerical (Discrete)
Curricular units 1st sem (approved) Approved units in 1st semester Numerical (Discrete)
Curricular units 1st sem (grade) Average grade in 1st semester Numerical (Continuous)
Curricular units 1st sem (without evaluations) Units without evaluations in 1st semester Numerical (Discrete)
Curricular units 2nd sem (credited) Credited units in 2nd semester Numerical (Discrete)
Curricular units 2nd sem (enrolled) Enrolled units in 2nd semester Numerical (Discrete)
Curricular units 2nd sem (evaluations) Number of evaluations in 2nd semester Numerical (Discrete)
Curricular units 2nd sem (approved) Approved units in 2nd semester Numerical (Discrete)
Curricular units 2nd sem (grade) Average grade in 2nd semester Numerical (Continuous)
Curricular units 2nd sem (without evaluations) Units without evaluations in 2nd semester Numerical (Discrete)
Unemployment rate Unemployment rate at time of enrollment Numerical (Continuous)
Inflation rate Inflation rate at time of enrollment Numerical (Continuous)
GDP GDP at time of enrollment Numerical (Continuous)
Target Student status (Dropout, Enrolled, or Graduate) Categorical

Through this step, we didn’t encounter any difficult challenges. The dataset was already clean and encoded, so we didn’t need to perform variable merging, one-hot encoding or ordinal encoding. We only had to convert categorical variables into readable labels to facilitate our visualization analysis.


3.- Preprocessing (Data Cleaning and Wrangling)

One of the most important steps in our project is data cleaning and wrangling. After running the code to check for missing values and undefined numerical data, we found that the dataset contains no missing values, no mistakes and no data entry mistakes.

The dataset was already encoded, and we removed “Application mode” and “Tuition fees up to date” variables because they are not relevant to our research questions. Therefore we dropped two columns from the dataset. Ensuring that the numeric columns are numeric, categorical variables such as “Gender”, “Debtor”, “Displaced” , “Daytime/Evening attendance” were translated to readable string labels for analysis. Although we had a well-structured and clean dataset, our main challenge was to determine the reliability of our dataset. We verified if there were any missing values, spotting mistakes, and determined irrelevant variables for our analysis. We pursue our cleaning work with the conversion of the categorical variables. Therefore, the reliable dataset was ready to be analyzed.

Code
def clean_dataframe(df, col_missing_thresh=0.30, row_missing_thresh=0.50):
    """Clean dataset with missing value handling."""
  
    # Count number of NaNs
    df = df.copy()
    missing = df.isna().sum()
    missing_data = missing[missing > 0]

    if len(missing_data) > 0:
        print(f"\n⚠️  Missing values found in {len(missing_data)} columns ({missing_data.sum():,} total)\n")
        display(missing_data.to_frame('Count'))
    else:
        print("\n✓ No missing values found!")

    print(f"\nShape after cleaning: {df.shape}")
    print(f"Missing values: {df.isna().sum().sum()}")
        
    # Drop columns with excessive missing
    col_frac = df.isna().mean()
    drop_cols = col_frac[col_frac > col_missing_thresh].index.tolist()
    if drop_cols:
        df.drop(columns=drop_cols, inplace=True)
    
    # Drop rows with excessive missing
    row_frac = df.isna().mean(axis=1)
    drop_rows = row_frac[row_frac > row_missing_thresh].index
    if len(drop_rows):
        df = df.drop(index=drop_rows).reset_index(drop=True)
    
    # Coerce numeric types
    df = df.apply(lambda s: pd.to_numeric(s, errors="ignore"))
    
    # Impute missing values
    for col in df.select_dtypes(include=[np.number]).columns:
        if df[col].isna().any():
            df[col] = df[col].fillna(df[col].median())
    
    for col in df.select_dtypes(include=["category","object"]).columns:
        if df[col].isna().any():
            mode = df[col].mode(dropna=True)
            if not mode.empty:
                df[col] = df[col].fillna(mode.iloc[0])
    
    return df

df = clean_dataframe(df)
print(f"Shape after cleaning: {df.shape}")
print(f"Missing values: {df.isna().sum().sum()}")

✓ No missing values found!

Shape after cleaning: (4424, 35)
Missing values: 0
Shape after cleaning: (4424, 35)
Missing values: 0

Although we had a well-structured and clean dataset, our main challenge was to determine the reliability of our dataset. We verified if there were any missing values, spotting mistakes, and determined irrelevant variables for our analysis. We pursue our cleaning work with the conversion of the categorical variables. Therefore, the reliable dataset was ready to be analyzed.


4. - Exploratory Data Analysis (EDA)

In this section, we explore the dataset to understand the main characteristics of the variables and how they relate to student outcomes (Dropout, Enrolled, Graduate). The goal of the EDA is to identify patterns, detect anomalies, and determine which features are most informative for predicting dropout.

4.1 - Target Variable

We begin by examining the distribution of the target variable. The three student outcomes (Dropout, Enrolled, and Graduate) are highly imbalanced, with Graduates representing the largest group, followed by Dropouts, and a smaller proportion of Enrolled students.

Code
# Recode target
target_col = "Target"
df[target_col] = df[target_col].replace({
    0: "Dropout", 
    1: "Enrolled", 
    2: "Graduate"
})
df[target_col] = pd.Categorical(
    df[target_col], 
    categories=["Dropout", "Enrolled", "Graduate"], 
    ordered=True
)

# Visualize
fig, ax = plt.subplots(figsize=(8, 5))
target_counts = df[target_col].value_counts()
colors = ['#e74c3c', '#f39c12', '#2ecc71']
bars = ax.bar(range(len(target_counts)), target_counts.values, 
              color=colors, edgecolor='black', linewidth=1.5)

ax.set_xticks(range(len(target_counts)))
ax.set_xticklabels(target_counts.index)
ax.set_ylabel('Number of Students', fontsize=11)
ax.set_title('Student Outcomes Distribution', fontsize=13, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

# Add labels
for i, v in enumerate(target_counts.values):
    ax.text(i, v + 30, f'{v}\n({v/len(df)*100:.1f}%)', 
            ha='center', fontweight='bold')

plt.tight_layout()
plt.show()


4.2 - Correlation Analysis

Code
# Calculate correlations
corr = df.corr(numeric_only=True)

# Create heatmap
plt.figure(figsize=(16, 14))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)

sns.heatmap(
    corr, 
    mask=mask,
    cmap='RdBu_r', 
    center=0,
    vmin=-1, 
    vmax=1,
    annot=True, 
    fmt='.2f',
    square=True,
    linewidths=0.5,
    cbar_kws={"shrink": 0.8, "label": "Correlation"}
)

plt.title('Correlation Matrix of Numeric Variables', 
          fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

Based on our correlation analysis, we identified several moderately and highly correlated variable pairs that indicate multicollinearity.

A high correlation between international and nationality students can be observed, therefore we choose to remove the variable “international”, since it won’t be as relevant as the “nacionality” variable.

We can see that the variables “father’s occupation” and “mother’s occupation” are highly correlated, but in this case the correlation reflects social structure. They represent two distinct individuals and two potentially different socioeconomic effects. Same thing applies for “mother’s qualification” and “father’s qualification”.

Although the variables “Curricular units 1st sem (enrolled)”-“Curricular units 2nd sem (enrolled)” and “Curricular units 1st sem (grade)” - “Curricular units 2nd sem (grade)” are respectively highly correlated, we keep them because they provide performance progression across different time periods, which is relevant for predicting dropout. Therefore, we excluded 8 redundant semester variables and one nationality variable.


4.3 - Feature Selection

Code
# Remove highly correlated features
columns_to_remove = [
    "Curricular units 1st sem (credited)", 
    "Curricular units 1st sem (evaluations)", 
    "Curricular units 1st sem (approved)",
    "Curricular units 1st sem (without evaluations)",
    "Curricular units 2nd sem (credited)", 
    "Curricular units 2nd sem (evaluations)", 
    "Curricular units 2nd sem (approved)",
    "Curricular units 2nd sem (without evaluations)",
    "International",
]

df = df.drop(columns=columns_to_remove)
print(f"Removed {len(columns_to_remove)} highly correlated variables")
print(f"Remaining variables: {df.shape[1]}")
Removed 9 highly correlated variables
Remaining variables: 26

4.4 - Outlier Detection

We implemented a type-aware outlier detection strategy that applies different methods based on the nature of each variable:

Binary variables (e.g., Gender, Scholarship holder): Outlier detection was skipped entirely, as these variables only contain two valid values (0/1).

Nominal categorical variables (e.g., Course, Nationality): No outlier detection applied, as these represent distinct categories without natural ordering. We only reported the number of unique categories present.

Ordinal categorical variables (e.g., qualifications, occupations): We reported the number of levels but did not apply outlier detection, as these represent ordered categories rather than continuous measurements.

Grade variables (0-200 scale): We checked for values outside the valid range (0-200). According to the dataset documentation, grades in the Portuguese system can range from 0 to 200.

Count variables (e.g., enrolled courses): We used a more lenient threshold of 3×IQR (Interquartile Range) rather than the standard 1.5×IQR, as count variables naturally exhibit right-skewed distributions where high values may represent legitimate cases (e.g., students enrolling in many courses).

Continuous variables (e.g., Age, GDP, Unemployment rate): We applied the standard Tukey method with 1.5×IQR threshold to identify potential outliers: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR.

This approach ensures that outlier detection is contextually appropriate for each variable type, reducing false positives while identifying genuine data quality issues.

Code
def detect_outliers_intelligent(df, var_type_dict):
    """Detect outliers based on variable type using simple statistical rules."""
    results = []
    
    # Binary variables - skip
    print("\n Binary variables (skipping outlier detection):")
    for col in var_type_dict.get('binary', []):
        if col in df.columns:
            unique_vals = sorted(df[col].dropna().unique())
            print(f"  - {col}: values = {unique_vals}")
    
    # Nominal categorical
    print("\n Nominal Categorical (no natural order):")
    for col in var_type_dict.get('nominal', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        print(f"  - {col}: {len(series.unique())} categories")
    
    # Ordinal categorical
    print("\n Ordinal Categorical (meaningful order):")
    for col in var_type_dict.get('ordinal', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        print(f"  - {col}: {len(series.unique())} levels")
    
    # Grade variables (0-200 scale + Z-score)
    print("\n Grade variables (0-200 range + Z-score > 3):")
    for col in var_type_dict.get('grades', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        
        # Check range violations
        invalid = ((series < 0) | (series > 200)).sum()
        
        # Check statistical outliers using Z-score
        mean, std = series.mean(), series.std()
        if std > 0:
            z_scores = np.abs((series - mean) / std)
            statistical_outliers = (z_scores > 3).sum()
        else:
            statistical_outliers = 0
        
        total_outliers = invalid + statistical_outliers
        outlier_pct = 100 * total_outliers / len(series) if len(series) > 0 else 0
        
        print(f"  - {col}: {invalid} out-of-range + {statistical_outliers} extreme (Z>3) = "
              f"{total_outliers} total ({outlier_pct:.1f}%)")
        
        if total_outliers > 0:
            results.append({
                'column': col, 'type': 'grade', 
                'issue': 'out_of_range + extreme',
                'count': total_outliers, 'pct': outlier_pct
            })
    
    # Count variables (Z-score > 3)
    print("\n Count variables (Z-score > 3):")
    for col in var_type_dict.get('counts', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        if len(series) == 0:
            continue
        
        mean, std = series.mean(), series.std()
        if std > 0:
            z_scores = np.abs((series - mean) / std)
            outliers = (z_scores > 3).sum()
        else:
            outliers = 0
        
        outlier_pct = 100 * outliers / len(series)
        
        print(f"  - {col}: extreme values: {outliers} ({outlier_pct:.1f}%)")
        
        if outliers > 0:
            results.append({
                'column': col, 'type': 'count', 
                'issue': 'extreme_outlier',
                'count': outliers, 'pct': outlier_pct
            })
    
    # Continuous variables (Z-score > 3)
    print("\n Continuous variables (Z-score > 3):")
    for col in var_type_dict.get('continuous', []):
        if col not in df.columns:
            continue
        series = df[col].dropna()
        if len(series) == 0:
            continue
        
        mean, std = series.mean(), series.std()
        if std > 0:
            z_scores = np.abs((series - mean) / std)
            outliers = (z_scores > 3).sum()
        else:
            outliers = 0
        
        outlier_pct = 100 * outliers / len(series)
        
        print(f"  - {col}: extreme values: {outliers} ({outlier_pct:.1f}%)")
        
        if outliers > 0:
            results.append({
                'column': col, 'type': 'continuous', 
                'issue': 'extreme_outlier',
                'count': outliers, 'pct': outlier_pct
            })
    
    return pd.DataFrame(results)

# Define variable types
var_types = {
    'binary': [
        "Daytime/evening attendance", "Educational special needs", 
        "Gender", "Scholarship holder", "Displaced", "Debtor", "International"
    ],
    'nominal': ["Course", "Nacionality"],
    'ordinal': [
        "Marital Status", "Application mode", "Application order",
        "Previous qualification", "Mother's qualification", 
        "Father's qualification", "Mother's occupation", "Father's occupation"
    ],
    'grades': [
        "Previous qualification (grade)", "Admission grade",
        "Curricular units 1st sem (grade)", "Curricular units 2nd sem (grade)"
    ],
    'counts': [
        "Curricular units 1st sem (enrolled)",
        "Curricular units 2nd sem (enrolled)"
    ],
    'continuous': ["Age at enrollment", "Unemployment rate", "Inflation rate", "GDP"]
}

# Run outlier detection
outlier_results = detect_outliers_intelligent(df, var_types)

 Binary variables (skipping outlier detection):
  - Daytime/evening attendance: values = [np.float64(0.0), np.float64(1.0)]
  - Educational special needs: values = [np.float64(0.0), np.float64(1.0)]
  - Gender: values = [np.float64(0.0), np.float64(1.0)]
  - Scholarship holder: values = [np.float64(0.0), np.float64(1.0)]
  - Displaced: values = [np.float64(0.0), np.float64(1.0)]
  - Debtor: values = [np.float64(0.0), np.float64(1.0)]

 Nominal Categorical (no natural order):
  - Course: 17 categories
  - Nacionality: 21 categories

 Ordinal Categorical (meaningful order):
  - Marital Status: 6 levels
  - Application order: 8 levels
  - Previous qualification: 17 levels
  - Mother's qualification: 29 levels
  - Father's qualification: 34 levels
  - Mother's occupation: 32 levels
  - Father's occupation: 46 levels

 Grade variables (0-200 range + Z-score > 3):
  - Previous qualification (grade): 0 out-of-range + 21 extreme (Z>3) = 21 total (0.5%)
  - Admission grade: 0 out-of-range + 22 extreme (Z>3) = 22 total (0.5%)
  - Curricular units 1st sem (grade): 0 out-of-range + 0 extreme (Z>3) = 0 total (0.0%)
  - Curricular units 2nd sem (grade): 0 out-of-range + 0 extreme (Z>3) = 0 total (0.0%)

 Count variables (Z-score > 3):
  - Curricular units 1st sem (enrolled): extreme values: 106 (2.4%)
  - Curricular units 2nd sem (enrolled): extreme values: 82 (1.9%)

 Continuous variables (Z-score > 3):
  - Age at enrollment: extreme values: 101 (2.3%)
  - Unemployment rate: extreme values: 0 (0.0%)
  - Inflation rate: extreme values: 0 (0.0%)
  - GDP: extreme values: 0 (0.0%)

4.5 - Outlier Summary

Code
if not outlier_results.empty:
    outlier_results = outlier_results.sort_values('pct', ascending=False)
    print("\n Detected Issues:")
    outlier_results
    
    # Visualize problematic variables
    for _, row in outlier_results.iterrows():
        col = row['column']
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
        
        # Histogram
        df[col].hist(bins=30, ax=ax1, edgecolor='black')
        ax1.set_title(f"Distribution")
        ax1.set_xlabel(col)
        ax1.set_ylabel("Frequency")
        ax1.grid(alpha=0.3)
        
        # Boxplot
        sns.boxplot(y=df[col], ax=ax2)
        ax2.set_title(f"Boxplot ({row['type']})")
        ax2.grid(alpha=0.3, axis='y')
        
        plt.suptitle(f"{col}: {row['count']} potential outliers ({row['pct']:.1f}%)", 
                     fontsize=12, fontweight='bold')
        plt.tight_layout()
        plt.show()
else:
    print("\n✅ No significant outliers detected!")

 Detected Issues:

We identified outliers in five different variables. Curricular units 1st semester (enrolled) and Curricular units 2nd semester (enrolled), which represent the number of courses students register for each semester. We observed 106 potential outliers in the first curricularsemester and 82 in the second semester. Since the average course load is usually 5 to 6 classes, students taking a much higher or lower number of courses are naturally flagged as outliers. In the first semester, the highest value reaches 26 classes. Although this is an ambitious workload, it remains possible. Several situations could explain such a high number: for instance, a student trying to complete their degree quickly, or a student retaking courses after previous failures. These cases can reflect meaningful academic behaviours, so removing them would risk losing useful information. For the second semester, the maximum value is around 20 classes, leading to similar conclusions. In both semesters, we also observe students enrolled in zero courses, which appears as an extreme value as well. This may correspond to students who completed most of their required courses earlier, or students taking a temporary break while still being officially enrolled. These profiles are still relevant and should be included. In this context, these extreme values are not problematic. On the contrary, they may help us understand whether taking unusually many, or unusually few, courses has an impact on college dropout. For this reason, we decided not to remove or cap these observations.

The next variable with detected outliers is Age at enrollment, for which 101 potential outliers were identified. Since the average age at enrollment is around 20 years old, students beginning their studies at 40 or 50 naturally appear as unusual cases. The oldest student is 70 years old, which is uncommon but not problematic for our analysis. Starting university at 70 does not change the fact that this person is a student, and these cases should be included. These values represent real and meaningful student profiles, such as mature students or individuals returning to education after a long break. Excluding them would remove important diversity from the dataset and limit our understanding of the different types of students who may or may not drop out. For this reason, we chose not to remove or cap the age-related outliers. Finally, outliers were also detected in Admission grade (22 cases) and Previous qualification grade (21 cases). These extreme values reflect either exceptionally high academic performance or, conversely, unusually low grades. Since these cases may provide insights into how prior academic achievement relates to dropout behavior, removing or capping them would not be appropriate. We therefore opted to retain all outliers in these grade variables as well. Based on our research questions, we conclude that removing these outliers would not benefit our analysis, as they do not represent errors but rather uncommon yet meaningful observations. Retaining them allows us to capture the full diversity of student profiles and provides a more accurate understanding of the factors that may influence college dropout.

4.6 - Feature Importance Analysis

4.6.1 - Methodology

We used one-way ANOVA (Analysis of Variance) to identify which numeric variables show significant differences across the three target groups (Dropout, Enrolled, Graduate). For each variable, we calculated:

  • p-value: Statistical significance of differences between groups (α = 0.05)
  • Eta-squared (η²): Effect size measure representing the proportion of variance explained by the target variable (ranges from 0 to 1, where higher values indicate stronger association)

Variables with p-value < 0.05 are considered significantly associated with student outcomes and may be strong predictors in classification models.

Code
# ANOVA for numeric variables
anova_results = {}
numeric_cols = df.select_dtypes(include=np.number).columns

for col in numeric_cols:
    groups = [df.loc[df[target_col] == cat, col].dropna()
              for cat in df[target_col].cat.categories
              if cat in df[target_col].unique()]
    
    # Need at least 2 non-empty groups
    if sum(len(g) > 0 for g in groups) < 2:
        continue
    
    from scipy.stats import f_oneway
    f_val, p_val = f_oneway(*groups)
    
    # Effect size: eta-squared
    grand_mean = df[col].mean()
    ss_between = sum(len(g) * (g.mean() - grand_mean) ** 2 for g in groups)
    ss_total = ((df[col] - grand_mean) ** 2).sum()
    eta_sq = ss_between / ss_total if ss_total > 0 else np.nan
    
    anova_results[col] = {"p_value": p_val, "eta_sq": eta_sq}

# Create results dataframe
anova_df = (pd.DataFrame(anova_results).T
            .sort_values(["p_value", "eta_sq"], ascending=[True, False]))
anova_df["significant"] = anova_df["p_value"] < 0.05

print(f"Significant variables (p < 0.05): {anova_df['significant'].sum()}")
anova_df.head(15)
Significant variables (p < 0.05): 21
p_value eta_sq significant
Curricular units 2nd sem (grade) 0.000000e+00 0.339086 True
Curricular units 1st sem (grade) 2.803052e-269 0.244020 True
Scholarship holder 4.436825e-94 0.092663 True
Age at enrollment 1.138849e-65 0.065412 True
Debtor 1.018223e-58 0.058620 True
Gender 9.950346e-53 0.052727 True
Curricular units 2nd sem (enrolled) 5.244430e-33 0.033066 True
Curricular units 1st sem (enrolled) 3.272852e-26 0.026197 True
Admission grade 4.380466e-16 0.015871 True
Displaced 2.425582e-13 0.013055 True
Previous qualification (grade) 1.077783e-12 0.012389 True
Marital Status 2.662987e-09 0.008892 True
Application order 2.955293e-09 0.008845 True
Daytime/evening attendance 5.534625e-07 0.006496 True
Mother's qualification 2.800636e-06 0.005767 True

4.7 - Top Predictive Variables

4.7.1 - Acadamic Performance Indicators

Our exploratory analysis shows relationships between academic performance measures and student outcomes (Dropout, Enrolled, Graduate). Several patterns emerge across admission grades, semester performance, and course load.

Code
plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Admission grade',
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Admission grade by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 1: Admission grade by student outcome

In Figure 1, we observe the distribution of the admission grade across the three categories (Dropout, Enrolled Target, and Graduate). Dropout students have an average admission grade of around 122, with several outliers reaching above 160. Enrolled Target students show a very similar average grade to Dropout students, but with fewer extreme values. Graduate students display a slightly higher average admission grade, around 125, and similarly present a few outliers above 160. Overall, the three groups show comparable distributions, with considerable overlap in their admission grades. Graduate students tend to have a marginally higher average, which may suggest that stronger academic preparation is associated with a greater likelihood of graduating. However, the presence of high admission grades in both the Dropout and Graduate categories indicates that good grades alone do not fully determine academic outcomes. In other words, while admission grade may play a role, it is not a decisive predictor of whether a student will graduate or drop out.

Code
plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Previous qualification (grade)',
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Previous qualification (grade) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 2: Previous qualification grade by student outcome

Figure 2 shows the distribution of the Previous qualification (grade) across the three target students which are Dropout, Enrolled, and Graduate. All three boxplots display similar characteristics, with medians around 130-133. The minimum and maximum values are also comparable from 100 to 165. The three groups have multiple outliers at both lower and upper extremes of the grade distribution, dropout and graduates are the one that show more extreme values.

For the interpretation, as the distributions and medians are quite similar this suggests that previous qualification grade is not a strong predictor of students’ performance. Interestingly the Dropout group’s median is quite high which indicates that students who drop out have not necessarily lower prior grades than those who graduate or stay enrolled.The outliers indicate that in each category there are both very high and very low grades, which suggests that there are other factors beyond academic performance.

Code
plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Curricular units 1st sem (grade)',
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Curricular units 1st sem (grade) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 3: First semester grade by student outcome

In Figure 3, we see the first-semester grades for the three target groups: Dropout, Enrolled Target, and Graduate. Dropout students show a wide range of grades. Enrolled Target students have grades around a median of 12.5, with moderate spread. Graduate students have the highest median, around 13.5, and a tighter distribution. For interpretation, the wide spread of Dropout students suggests that leaving the program is not only due to low grades. Enrolled Target students show average performance, indicating steady progress but not full completion. Graduate students perform consistently better, suggesting that higher and more stable first-semester grades are associated with graduation.

Code
plt.figure(figsize=(8, 5))

sns.boxplot(
    x=target_col,
    y='Curricular units 2nd sem (grade)',
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)

plt.title(f"Curricular units 2nd sem (grade) by {target_col}", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 4: Boxplot of Grades by Target

Figure 4 reveals distinct patterns across the three groups. The Dropout category displays the widest range of performance. Enrolled students demonstrate moderate variability with a median near 12 units. Graduates show the tightest distribution and highest median at approximately 13 units. By the second semester, the gaps between groups widen. Many dropouts completed few or no units (the distribution starts at 0), indicating this is likely when they left the program. Graduates continued performing well with consistent results around 13 units. Enrolled students fell somewhere in between with decent but mixed performance. The second semester appears to be a turning point where struggling students drop out while successful students keep their momentum.

Code
plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Curricular units 1st sem (enrolled)',
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Curricular units 1st sem (enrolled) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 5: First semester enrollment by student outcome

Figure 5 demonstrates the relationship between Curricular units 1st Sem (enrolled) and the three Target outcomes (Dropout, Enrolled, Graduate). All three groups show similar box positions with medians around 5-6 units. Dropouts and Enrolled students have nearly identical distributions, while Graduates have a slightly higher box position. All groups show numerous outliers, particularly on the upper end, with some students enrolling in 15-26 units. Figure 5 reveals that the number of courses taken is not a factor influencing different outcomes, since all groups show similar enrollment patterns. Many high outliers appear across all groups, suggesting that ambitious enrollment is common regardless of eventual outcome.

Code
plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Curricular units 2nd sem (enrolled)',
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Curricular units 2nd sem (enrolled) by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 6: Second semester enrollment by student outcome

As shown in Figure 6, Dropout and Enrolled students have similar distributions with their boxes positioned in the lower range. Graduate students show a noticeably higher box position and a wider spread. All three groups display numerous outliers, particularly on the upper end. Like the 1st semester enrollment patterns, the 2nd semester shows that graduates tend to enroll in slightly more courses, though the differences remain modest. The similar enrollment behavior between dropouts and enrolled students suggests that course load decisions in the 2nd semester don’t strongly differentiate these groups - the key difference lies in completion rates rather than enrollment ambitions.

Code
tab = (pd.crosstab(df[target_col], df['Daytime/evening attendance'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True,
        color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Daytime/evening attendance by Target', fontsize=12, fontweight='bold')
plt.legend(title='Attendance', labels=['Evening', 'Daytime'],
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
Figure 7: Daytime/evening attendance by student outcome

Figure 7 shows the proportion of daytime and evening attendance within the three groups (Dropout, Enrolled, Graduate). Daytime attendance dominates across all three groups, representing approximately 85-90% of students. However, Dropout students show a slightly higher proportion of evening attendance (around 15%) compared to Enrolled and Graduate students (around 10%). This small difference might indicate that evening students face additional challenges, though the similarity across all groups suggests attendance timing is not a primary driver of dropout rates.

Code
plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Application order',
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Application order by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 8: Application order by student outcome

Figure 8 shows the Application order by the three target students. All three groups show similar distributions and are positioned in the lower range. The medians are approximately 1.5-2 for all categories. The upper whisker is similar for the three groups, reaching 3 and the lower whisker is at 0 for Graduates and around 1 for Dropout and Enrolled. There are numerous outliers that are at 4, 5 and 6, and even 9 for Enrolled category, indicating that some students applied as their 4th, 5th, 6th and 9th choice. Regarding the interpretation, as the distribution is similar in the three categories this implies that the application order has not a strong relationship with students’ success. Most students have applied to this institution as their first or second choice, suggesting that institutional preferences do not really predict if a student will drop out, stay enrolled or graduate. We can also confirm that, as the outliers are similar, the application order is not a meaningful predictor of a student’s performance.

Code
tab = (pd.crosstab(df[target_col], df['Displaced'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True,
        color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Displaced by Target', fontsize=12, fontweight='bold')
plt.legend(title='Displaced', labels=['No', 'Yes'],
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
Figure 9: Displaced status by student outcome

Figure 9 shows the proportion of displaced students (those who moved or changed residence) across the three target groups. Dropout students have the highest proportion of non-displaced students at around 53%. Enrolled students show about 45% non-displaced. Graduate students have the lowest at approximately 40% non-displaced, meaning 60% of graduates relocated. The pattern shows that students who relocated for their studies were more likely to graduate. This could be because moving demonstrates stronger commitment to education, or because staying home means dealing with work, family responsibilities, or other obligations that interfere with studying. Dropouts were the least likely to have relocated, suggesting that remaining in their original environment may have made it harder to focus on academics.

4.7.2 - Key Findings for Academic Performance and Study Conditions

Graduates have higher admission grades and previous qualification grades compared to dropouts, though the differences are relatively small. This demonstrates that prior academic preparation shows limited predictive power.

First-semester grades are the strongest predictor of students’ performance. Students who drop out show dramatically lower grades (many between 0-5), while graduates consistently have higher grades (median around 12). First semester performance is therefore a critical warning signal for identifying at-risk students.

Graduates tend to enroll in more courses in the first semester (median around 6-7) compared to those who drop out (median around 5-6), this may reflect a stronger initial academic engagement, even though this difference remains small.

Daytime/evening attendance suggests an observable difference and proves to be an important predictor. Evening students show higher drop out rates, around 15% of dropouts compared to 10% for graduates. This reflects additional challenges faced by students who must balance work, or family responsibilities with their studies.

Students who are displaced have higher graduation rates (60% of graduates vs around 48% of dropouts). This counter-intuitive pattern reflects that relocating for studies may reflect stronger commitment or independence.

4.7.3 Demographic & Socioeconomic Background

Code
plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y='Age at enrollment',
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title('Age at enrollment by Target', fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 10: Age at enrollment by student outcome

Figure 10 demonstrates the relationship between Age at enrollment and the three students outcomes. The Dropout group has the highest median age, which is approximately 23 and the widest interquartile range. The Enrolled group has a median age around 20-21, while the Graduate group shows the lowest median age at around 19. It is also the narrowest. The three groups contain numerous outliers, showing particularly older students from late 30th to 70 years old.

This suggests that age at enrollment is a significant predictor of students’ performance. We see that students who enroll at a younger age are more likely to graduate, while older students face more risk of dropping out. This can be caused by several factors, such as the fact that younger students may have fewer external responsibilities compared to older students that may deal with multiple commitments that can interfere with their studies. The wider distribution dropout’s group shows that students can occur at any age. However, older students do successfully graduate, showing that age does not determine success alone.

Code
tab = (pd.crosstab(df[target_col], df['Gender'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True,
        color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Gender by Target', fontsize=12, fontweight='bold')
plt.legend(title='Gender', labels=['Female', 'Male'],
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
Figure 11: Gender by student outcome

Figure 11 shows the proportion of genders within each of the three target groups (Dropout, Enrolled Target, Graduate). In the Dropout group, the proportion of male and female students is almost equal. In contrast, both the Enrolled Target and Graduate groups have a higher proportion of female students than male students.

This suggests that female students tend to persist and complete their studies at higher rates than male students. Male students appear slightly more likely to interrupt or drop out of their programs, which may contribute to the lower proportions observed in the Enrolled Target and Graduate groups.

Code
#plt.figure(figsize=(8, 5))
tab = (pd.crosstab(df[target_col], df['Scholarship holder'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True, 
        color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Scholarship holder by Target', fontsize=12, fontweight='bold')
plt.legend(title='Scholarship holder', labels=['No', 'Yes'], 
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
Figure 12: Scholarship holder status by student outcome

Figure 12 shows the proportion of scholarship holders within each target group (Dropout, Enrolled Target, Graduate). In both the Dropout and Enrolled Target groups, the vast majority of students do not receive a scholarship, with only a small proportion being scholarship holders. In contrast, the Graduate group contains a noticeably higher proportion of scholarship recipients. This suggests that students who receive a scholarship may be more likely to graduate than those who do not. Scholarships often reduce financial pressure and provide support that may help students remain enrolled and complete their studies. Conversely, students without scholarships seem more represented among dropouts and ongoing enrollments.

Code
tab = (pd.crosstab(df[target_col], df['Debtor'])
       .apply(lambda r: r / r.sum(), axis=1))
tab = tab.reindex(columns=sorted(tab.columns.tolist()))

tab.plot(kind="bar", stacked=True,
        color=['#e74c3c', '#2ecc71'], edgecolor='black')
plt.ylabel("Proportion within target group")
plt.title('Debtor by Target', fontsize=12, fontweight='bold')
plt.legend(title='Debtor', labels=['No', 'Yes'],
         bbox_to_anchor=(1.02, 1), loc="upper left")
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
Figure 13: Debtor status by student outcome

The Figure 13 shows that the dropout group has the largest proportion of students who are debtors. Enrolled students still include some debtors, but the proportion is noticeably smaller. In the graduate group, almost all students have no debt, with only a very small fraction appearing as debtors.

This trend suggests that having debt is more common among students who end up dropping out, hinting that financial pressure may contribute to early departure. Conversely, students without debt seem more likely to remain enrolled and reach graduation.

Code
plt.figure(figsize=(8, 5))

# Encode target categories as numbers
x_encoded = df[target_col].cat.codes

# Add jitter to avoid overlap
x_jitter = x_encoded + np.random.normal(0, 0.05, size=len(df))
y_jitter = df['Marital Status'] + np.random.normal(0, 0.05, size=len(df))

# Beautiful colormap
colors = plt.cm.viridis(x_encoded / x_encoded.max())

plt.scatter(
    x_jitter,
    y_jitter,
    s=40,
    alpha=0.75,
    c=colors,
    edgecolor="black",
    linewidth=0.4
)

plt.xticks(df[target_col].cat.codes, df[target_col])
plt.xlabel("Target")
plt.ylabel("Marital Status")
plt.title("Marital Status by Target", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 14: Marital status by student outcome

Figure 14 shows the distribution of marital status across the three target groups using a jittered scatter plot. The points align in nearly identical horizontal bands for each category, indicating that the marital status profiles are essentially the same among Dropout, Enrolled, and Graduate students. The visual clearly shows that the vast majority of students are single, which is expected given the typical age range of university students. The less common marital statuses—married, divorced, widower, facto union, and legally separated—appear only sporadically and are spread evenly across the three groups. Overall, the plot suggests that marital status has no meaningful relationship with student outcomes. Since the three distributions look almost identical, marital status offers very limited predictive value for distinguishing at-risk students or for explaining academic success. this aligns with Anova since in the test, Marital Status was above the tenth place in importance.

Code
plt.figure(figsize=(8, 5))
sns.boxplot(
    x=target_col,
    y="Mother's qualification",
    data=df,
    palette=['#e74c3c', '#f39c12', '#2ecc71']
)
plt.title("Mother's qualification by Target", fontsize=12, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Figure 15: Mother’s qualification by student outcome

Figure 15 shows the Mother’s qualification across the three target groups. Enrolled and Graduate distributions are identical. The three groups display nearly identical distributions with all median at approximately 19. However, the Dropout group shows a slightly wider interquartile range, extending lower compared to the two other groups. They reach similar limits and there are no outliers.

While the three distributions are very similar, the dropout category shows slightly fewer students with mothers who have very low qualifications, but this is minimal. The overall similarity indicates that a mother’s qualification has little influence on whether students complete their studies.

4.7.2 - Key Findings for Demographic & Socioeconomic Background

Younger students are more likely to graduate while older students face higher dropout risk. The dropout group also shows the widest age variation. This suggests that older students may face competing life responsibilities that interfere with their studies. Female students graduate at slightly higher rates, making up a larger proportion of both enrolled and graduate groups compared to dropouts (50-50 split).

Graduates have a noticeably higher proportion of scholarship recipients compared to dropouts and enrolled students, where the majority receive no scholarship. This suggests financial support helps complete their studies. Dropouts have the largest proportion of students who are debtors. Enrolled students include some debtors but fewer than Dropout students. Graduates have almost no debt. Financial pressure is strongly associated with dropout, while financial stability is associated with graduation. All three groups show identical distributions. Other marital statuses appear equally as outliers across all categories. Marital status has no relationship with student outcomes and no predictive value. The three groups display nearly identical distributions with medians around 19. While the dropout group shows a slightly wider spread extending lower, the difference is minimal. Mother’s qualification has little to no influence on whether students complete their studies.

5 - Methods (Predictive Modelling)

In this section, we begin building a predictive model aimed at understanding which factors are most strongly associated with student dropout. Our goal is not only to classify students into the three outcome categories (Dropout, Enrolled, Graduate), but also to identify which variables contribute most to the risk of dropping out.

We train a Random Forest classifier as a first baseline model. This allows us to evaluate predictive performance and obtain a first indication of which features may be important. To further interpret and validate these results, we use LIME explanations, both at the individual level (example students) and globally across multiple samples.

This modelling part is therefore an exploratory step toward understanding dropout risk: the aim is to identify meaningful patterns, highlight influential academic or demographic factors, and evaluate which features could be most relevant for predicting student success or failure. Later, these insights can be refined and made more specific to dropout prediction.

Code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import lime
import lime.lime_tabular
import matplotlib.pyplot as plt
import seaborn as sns

# Use the dataframe after feature selection (columns already removed)
# This df already has the 8 redundant columns removed
print(f"Starting with {df.shape[1]} features (after feature selection)")

# Create working dataframe - use all columns that exist
df_model = df.copy()

# Remove rows with missing values
print(f"Dataset shape before cleaning: {df_model.shape}")
df_model = df_model.dropna()
print(f"Dataset shape after removing missing values: {df_model.shape}")

print(f"\nTarget distribution:")
print(df_model['Target'].value_counts())

# Separate features and target
X = df_model.drop('Target', axis=1)
y = df_model['Target']

print(f"\nUsing {X.shape[1]} features for classification")

# Encode categorical variables
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
label_encoders = {}

print(f"\n Encoding {len(categorical_cols)} categorical variables...")
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))
    label_encoders[col] = le

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTraining set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

# Train Random Forest model
print("\n Training Random Forest Classifier...")
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)

# Evaluate model
print("\n Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# Feature Importance from Random Forest
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n Top 15 Most Important Features (Random Forest):")
print(feature_importance.head(15))

# Plot feature importance
plt.figure(figsize=(10, 8))
top_15 = feature_importance.head(15)
plt.barh(range(len(top_15)), top_15['importance'])
plt.yticks(range(len(top_15)), top_15['feature'])
plt.xlabel('Importance')
plt.title('Top 15 Feature Importance (Random Forest)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# LIME Explainer
print("\n Setting up LIME explainer...")
explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train.values,
    feature_names=X.columns.tolist(),
    class_names=[str(c) for c in sorted(y.unique())],
    mode='classification',
    random_state=42
)

# Explain a few predictions
print("\n LIME Explanations for Sample Predictions:")

# Select 3 random test samples
np.random.seed(42)
sample_indices = np.random.choice(X_test.index, size=min(3, len(X_test)), replace=False)

for idx, sample_idx in enumerate(sample_indices):
    sample = X_test.loc[sample_idx].values
    true_label = y_test.loc[sample_idx]
    pred_label = rf_model.predict([sample])[0]
    pred_proba = rf_model.predict_proba([sample])[0]
    
    print(f"\n--- Sample {idx + 1} ---")
    print(f"True Label: {true_label}")
    print(f"Predicted Label: {pred_label}")
    print(f"Prediction Probability: {pred_proba}")
    
    # Generate LIME explanation
    exp = explainer.explain_instance(
        sample,
        rf_model.predict_proba,
        num_features=10
    )
    
    # Show explanation
    print("\nTop 10 features influencing this prediction:")
    for feature, weight in exp.as_list():
        print(f"  {feature}: {weight:.3f}")
    
    # Plot explanation
    fig = exp.as_pyplot_figure()
    plt.tight_layout()
    plt.show()

# Global feature importance from LIME (sample-based)
print("\n Computing global LIME feature importance (sampling 100 instances)...")

# Sample instances for global importance
sample_size = min(100, len(X_test))
sample_indices_global = np.random.choice(len(X_test), size=sample_size, replace=False)

lime_weights = {feature: [] for feature in X.columns}

for i in sample_indices_global:
    exp = explainer.explain_instance(
        X_test.iloc[i].values,
        rf_model.predict_proba,
        num_features=len(X.columns)
    )
    
    for feature, weight in exp.as_list():
        # Extract feature name (LIME returns feature with value range)
        feature_name = feature.split('<=')[0].split('>')[0].split('=')[0].strip()
        # Find matching column (partial match)
        for col in X.columns:
            if col in feature_name or feature_name in col:
                lime_weights[col].append(abs(weight))
                break

# Compute average absolute weight for each feature
lime_importance = pd.DataFrame({
    'feature': list(lime_weights.keys()),
    'lime_importance': [np.mean(weights) if weights else 0 for weights in lime_weights.values()]
}).sort_values('lime_importance', ascending=False)

print("\n Top 15 Most Important Features (LIME Global):")
print(lime_importance.head(15))

# Compare Random Forest vs LIME importance
plt.figure(figsize=(12, 8))
comparison = feature_importance.merge(lime_importance, on='feature', how='left')
comparison['lime_importance'] = comparison['lime_importance'].fillna(0)
top_features = comparison.nlargest(15, 'importance')

x = np.arange(len(top_features))
width = 0.35

plt.barh(x - width/2, top_features['importance'], width, label='Random Forest', alpha=0.8)
plt.barh(x + width/2, top_features['lime_importance'], width, label='LIME', alpha=0.8)

plt.yticks(x, top_features['feature'])
plt.xlabel('Importance Score')
plt.title('Feature Importance Comparison: Random Forest vs LIME')
plt.legend()
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Starting with 26 features (after feature selection)
Dataset shape before cleaning: (4424, 26)
Dataset shape after removing missing values: (4424, 26)

Target distribution:
Target
Graduate    2209
Dropout     1421
Enrolled     794
Name: count, dtype: int64

Using 25 features for classification

 Encoding 0 categorical variables...

Training set size: 3539
Test set size: 885

 Training Random Forest Classifier...

 Model Performance:
Accuracy: 0.712

Classification Report:
              precision    recall  f1-score   support

     Dropout       0.76      0.72      0.74       284
    Enrolled       0.49      0.14      0.22       159
    Graduate       0.71      0.91      0.80       442

    accuracy                           0.71       885
   macro avg       0.65      0.59      0.58       885
weighted avg       0.68      0.71      0.67       885


 Top 15 Most Important Features (Random Forest):
                                feature  importance
21     Curricular units 2nd sem (grade)    0.236387
19     Curricular units 1st sem (grade)    0.180686
15                    Age at enrollment    0.068759
11                      Admission grade    0.052328
2                                Course    0.042544
14                   Scholarship holder    0.040479
5        Previous qualification (grade)    0.039747
18  Curricular units 1st sem (enrolled)    0.035814
17                               Debtor    0.033768
20  Curricular units 2nd sem (enrolled)    0.032511
10                  Father's occupation    0.030694
9                   Mother's occupation    0.027999
24                                  GDP    0.024098
22                    Unemployment rate    0.023999
8                Father's qualification    0.022038


 Setting up LIME explainer...

 LIME Explanations for Sample Predictions:

--- Sample 1 ---
True Label: Graduate
Predicted Label: Graduate
Prediction Probability: [0.22431332 0.15233554 0.62335114]

Top 10 features influencing this prediction:
  12.32 < Curricular units 1st sem (grade) <= 13.40: -0.032
  Age at enrollment > 25.00: -0.030
  Scholarship holder <= 0.00: 0.028
  Debtor <= 0.00: 0.021
  Curricular units 2nd sem (enrolled) > 7.00: -0.017
  Gender <= 0.00: -0.012
  19.00 < Mother's qualification <= 37.00: -0.010
  12.20 < Curricular units 2nd sem (grade) <= 13.33: 0.009
  9238.00 < Course <= 9556.00: -0.008
  19.00 < Father's qualification <= 37.00: -0.008


--- Sample 2 ---
True Label: Dropout
Predicted Label: Dropout
Prediction Probability: [0.46863318 0.2735755  0.25779133]

Top 10 features influencing this prediction:
  Curricular units 1st sem (grade) <= 11.00: 0.079
  Curricular units 2nd sem (grade) <= 10.78: -0.049
  Age at enrollment > 25.00: -0.036
  Scholarship holder <= 0.00: 0.028
  Father's occupation > 9.00: 0.024
  Mother's occupation > 9.00: 0.023
  Curricular units 1st sem (enrolled) <= 5.00: 0.020
  Curricular units 2nd sem (enrolled) <= 5.00: 0.018
  Debtor <= 0.00: 0.016
  Nacionality <= 1.00: -0.011


--- Sample 3 ---
True Label: Dropout
Predicted Label: Dropout
Prediction Probability: [0.87755488 0.1020908  0.02035431]

Top 10 features influencing this prediction:
  Curricular units 1st sem (grade) <= 11.00: 0.078
  Curricular units 2nd sem (grade) <= 10.78: -0.048
  Age at enrollment > 25.00: -0.031
  Educational special needs <= 0.00: -0.030
  Scholarship holder <= 0.00: 0.029
  Nacionality <= 1.00: -0.019
  Debtor > 0.00: -0.019
  5.00 < Curricular units 1st sem (enrolled) <= 6.00: -0.016
  Gender <= 0.00: -0.015
  Course <= 9085.00: 0.012


 Computing global LIME feature importance (sampling 100 instances)...

 Top 15 Most Important Features (LIME Global):
                                feature  lime_importance
19     Curricular units 1st sem (grade)         0.045241
21     Curricular units 2nd sem (grade)         0.031206
14                   Scholarship holder         0.027479
15                    Age at enrollment         0.018046
17                               Debtor         0.016204
6                           Nacionality         0.012869
13                               Gender         0.011791
18  Curricular units 1st sem (enrolled)         0.011203
20  Curricular units 2nd sem (enrolled)         0.010488
12            Educational special needs         0.009385
22                    Unemployment rate         0.008238
2                                Course         0.006842
4                Previous qualification         0.005702
7                Mother's qualification         0.005368
8                Father's qualification         0.005340

Conclusion and Next Steps

To the extent covered so far in this project, we achieved the fundamental steps for a rigorous analysis of our work. We described the context of the research background, presented relevant research questions and prepared the dataset for analysis. We collected and examined the dataset from the UCI Machine Learning Repository, verified its reliability by conducting systematic preprocessing. During this stage, we checked for any missing values, identified anomalies (outliers), translated categorical variables into readable format for visualization, and removed irrelevant or redundant variables. Through EDA, we analyzed academic, demographic and socioeconomic factors, and identified initial patterns associated with the “Target” variable (Dropout, Enrolled, Graduate). In exploring these key insights such as first-semester performance, age at enrollment, scholarship status, and financial stability. We established the preliminary steps for the predictive modeling phase that will be built in the following weeks.

During the upcoming weeks (November 18 to December 7, Weeks 5 to 7), we will focus on building and evaluating the predictive models. We will have to test and determine a suitable modeling technique, such as random forests or decision trees. Our objective is also to assess model performance, using appropriate metrics. We also have to identify the most important variables that contribute to prediction accuracy, and determine which modeling approach best predicts student outcomes related to the research questions. From December 8 to December 14 (Week 8), we will analyze and interpret the results of the predictive model to determine the insights that help answer our research questions. After evaluating the relevant variables and their effects, we will link the modeling outcomes to the patterns observed during the EDA. This will allow us to provide clearer answers to our research questions. In the final week (December 15th, week 9), we will dedicate our time to preparing the video presentation and finalizing the written report in Quarto, ensuring coherent and clear findings aligned with our overall analysis.

References

Europe-Data.com. (2025). One in six young people in portugal have dropped out of education. Europe-Data.com. https://europe-data.com/one-in-six-young-people-in-portugal-have-dropped-out-of-education/
Hachmeister, C.-D., & Berghoff, S. (2024). German universities intensify measures to prevent student drop-out. CHE Centre for Higher Education. https://www.che.de/en/2024/german-universities-intensify-measures-to-prevent-student-drop-out/
Sokolova, T. (2025). Dropout rates in universities worldwide: Trends and reasons. educations.com. https://www.educations.com/higher-education-news/rising-dropout-rates-in-universities-worldwide-reasons-and-solutions